Co-evolution-based machine learning predicts functional interactions between human genes | Nature Communications

2021-11-16 07:45:03 By : Ms. Macy S Lee

Thank you for visiting Nature. The browser version you are using has limited support for CSS. For the best experience, we recommend that you use a newer version of the browser (or turn off the compatibility mode in Internet Explorer). At the same time, to ensure continued support, we will display sites without styles and JavaScript.

Nature Communications Volume 12, Article Number: 6454 (2021) Cite this article

In the next ten years, it is expected that more than one million eukaryotic species will be fully sequenced. This may improve our understanding of genotype and phenotype crosstalk, gene function and interaction, and answer evolutionary questions. Here, we have developed a machine learning method to use the phylogenetic maps of 1,154 eukaryotic species. This method integrates co-evolution across eukaryotic clades to predict the functional interactions between human genes and the context of these interactions. We benchmarked our method and showed a 14% performance improvement (auROC) compared to the previous method. Using this method, we predict the functional annotations of genes that are less studied. We focus on DNA repair and verify that 9 of the top 50 predicted genes have been identified elsewhere, and other genes have previously been prioritized through high-throughput screening. In general, our method can better annotate functions and functional interactions, and help understand the evolutionary process of co-evolution. A web server is attached to the manuscript at https://mlpp.cs.huji.ac.il.

The genome revolution has led to the sequencing of thousands of species, and more species are sequenced every year. Using comparative genomics methods to study crosstalk between genes, functions, traits, and the species that contain them, it is easy to analyze the explosive growth of genomic data from different species. One of the methods is phylogenetic analysis, which is an established method for identifying functionally related genes and protein-protein interactions (PPI) in prokaryotes and eukaryotes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Phylogenetic analysis is based on the assumption that functionally related genes are related to similar evolutionary pressures and therefore are lost or kept together throughout the evolutionary process. For example, by comparing the proteome of non-ciliated organisms with species with prototype or modified cilia, the genes associated with cilia can be identified and classified4,11. Similarly, others have successfully identified mitochondrial genes based on the evolutionary patterns of mitochondrial genes lost and retained in different species5,12.

In recent years, as the number of sequenced organisms continues to increase, it has become possible to apply co-evolution analysis at the clade level (for example, animals, mammals, and fungi) to compare signals at different evolutionary scales. It is hypothesized that functionally related genes may show different co-evolutionary patterns in specific clade. This phenomenon may be due to genes becoming functionally related in the later stages of evolution (for example, the function was first introduced in the last common ancestor of certain clades). When examining co-evolution at the clade level, other more interesting co-evolution processes may also be discovered. For example, a group of genes that functionally interact in a common ancestor may lose their co-evolution in some but not all child clades further away in the tree.

In order to better capture co-evolution, "clade" phylogenetic analysis methods have been developed3,7,10. Shin and Lee showed how integrated phylogenetic maps across life domains can improve the prediction of functionally interacting genes. Subsequently, Sherill-Rofe and Rahat3 identified DNA repair-related genes by integrating seven clades of co-evolution signals, revealing the applicability of such methods. However, these methods are either low resolution—focusing on the realm of life—or they are only suitable for predicting gene sets. It has recently been demonstrated that co-evolution of clades can improve the prediction of functional interactions between human genes at the level of eukaryotic clades. In addition, when more species are added, the performance of classical phylogenetic analysis has been shown to saturate. In contrast, when more species are sequenced, clade phylogenetic analysis has the potential to improve performance.

It was previously assumed that different types of pathways (such as metabolism, signal transduction) may co-evolve in different ways. Some biological processes may be easier to change than others, leading to various co-evolution phenomena. For example, in the entire evolutionary process, signaling pathways may reconnect more frequently than metabolic pathways, resulting in different patterns in the phylogeny of different pathway types14. This represents how to infer the interaction context directly from the co-evolutionary model, in this case the pathway type. Therefore, clade signals can potentially be used to facilitate the prediction of functionally related genes and the prediction of the interaction environment.

A special application of phylogenetic analysis is to assign functions to less studied genes. Many human genes are largely uncharacteristic, so they are called "Ignorome". Pandey et al15 studied the brain Ignorome and found that about 70% of the studies involved the top 5% of the most studied genes, while the 20% of genes were hardly mentioned in the literature. Others have found similar patterns in other Ignoromes, focusing on why some genes are highly studied while others are usually overlooked. These works identify the time of the first description as the most prominent factor (ie the phenomenon that the rich get richer)16,17. Recently, the NeXtprot consortium identified approximately 2,000 human genes with no known function. Phylogenetic analysis provides an unbiased method of gene function annotation, allowing us to better understand the function of Ignorome genes.

Phylogenetic analysis requires evolutionary insights into how pathways evolve. Previous work has determined evolutionary insights from the phylogenetic spectrum by linking biological characteristics with co-evolving genomes. This is based on the hypothesis that the coordinated loss of functionally related genes in a particular organism indicates that these organisms have undergone major phenotypic changes. For example, the lack of heme biogenesis genes in ticks and parasitic nematodes is related to the adaptation to the environmental source of heme 2. Similarly, Li et al. After analyzing mitochondrial genes that are known to be missing from the mitochondrial genome,5 Dey and others examined the cilia pathway in cilia and non-ciliated organisms4,5,11. However, these findings were either driven by manual inspection2,4 or limited by the complexity of the model5,19. At the macro level of pathway co-evolution, Dey et al.14 determined the types of pathways that can be identified through the phylogenetic spectrum, and characterized pathway types through divergence and evolutionary age, and determined the relationship between co-evolution and the above-mentioned general functions.

Here, we propose a supervised machine learning method that uses the co-evolution of "clades" of functionally related genes for phylogenetic analysis. This method predicts the functional interaction between human genes and the interaction context (ie, biological function) in which the functional interaction occurs. Then we extended this method to annotate gene function, focusing on the function annotation of less studied genes. Based on the prediction of each pathway type, we prioritize the functions and interaction partners of these genes, and use specific examples of DNA repair candidate genes verified by existing evidence in the literature. Finally, we examined the evolutionary insights revealed by our method at the pathway level, pathway types, and macro-all pathway levels. These evolutionary insights led us to determine the importance of parasitic species in our method predictions, as well as other potential phylogenetic analysis methods. We explored this phenomenon and showed how it manifests in the loss of multiple biological functions in the parasitic clade. The paper is accompanied by a web server that allows users to explore the functional interaction predictions of all human genes. We propose three types of analysis corresponding to the analysis found in the paper: the functional interaction of individual genes and gene sets, and the functional annotation of genes. The web server can be accessed through the following URL: https://mlpp.cs.huji.ac.il

Clade-wise phylogenetic analysis (PP) takes into account the co-evolution of genes at different evolutionary scales, from kingdom to species level3,7,10. In addition, studies have shown that different pathway types may exhibit different co-evolutionary patterns. For example, metabolic pathways are more conserved throughout the evolutionary process, and signaling pathways are often rerouted14. Therefore, we try to use clade-wise PP to improve the predictive ability of PP, and to predict the interactive context by developing a supervised method based on machine learning (physical analysis based on machine learning-MLPP). This method integrates phylogenetic analysis signals from 49 clades of 1154 species, including the eukaryotic tree of life. Since the tree of life is hierarchical in nature, the clades used for analysis are selected to cover the entire eukaryotic space while reducing overlap and collinearity between input features (see "Methods", Supplementary Table 1, Supplementary Figure 1).

We first calculated a gene-by-gene matrix, representing the sequence similarity of each gene in each species to its human orthologs, and a given row contains the phylogenetic profile of a single gene (see "Methods"). Then, for each clade, we calculated the covariance between the phylogenetic profiles of each pair of genes as a feature of the machine learning algorithm. Therefore, for each gene pair, we use the 49 clade covariances of the gene phylogeny as features. Then we trained a binary classification model to predict gene pair function interactions, defined as co-occurrences in any response group pathway20. We use the same 49 features to train additional models to predict the interaction context of each context separately. The interaction context refers to the ways and functions of genes that are functionally related, and is hereby defined as the co-occurrence of genes in certain pathway types (12 Reactome top-level pathways, such as metabolism, immune system, and 28 advanced gene ontology). Term) or co-occurrence of protein complexes in the reaction group (see "Methods")

We compared the performance of several machine learning algorithms and positive unlabeled frameworks 21, 22, 23, 24, and selected a random forest classification algorithm (similar to Claesen et al. 23) based on the performance and robustness of unlabeled data ( See "Methods", Supplementary Methods, Supplementary Figures 2 and 3). To determine the additional benefits of using clades compared to random sets of organisms, we compared real clades with random clades. The comparison shows that the tree structure and clade-specific evolution are indeed important for the performance of the method (Supplementary Text 1, Supplementary Table 2). This method is also robust to the selection of blasting pretreatment (see "Methods", Supplementary Table 3).

In order to test the performance of our method, we compared it with four established PP methods: normalized phylogenetic spectrum (NPP) 1, SVD-Phy25, PrePhyloPro (PPP) 26 and binarized phylogenetic spectrum ( The Hamming distance (BPP) on BPP is 9,26. These four methods do not consider clades, and are only based on the similarity measurement between gene phylogenetic profiles. We show that our method has undergone functional interaction training of Reactome, and is superior to other methods in terms of auROC (Figure 1A) and some auROC (FPR <0.1) and average accuracy (Supplementary Table 4, Supplementary Figure 5), compared to sub-optimal methods It achieved 14%, 3% and 10% growth respectively.

The receiver operating curve (ROC) and area under the curve (AUC) of our model (MLPP) and other phylogenetic analysis methods in predicting functionally interacting gene pairs are compared. In addition, MLPP can predict interaction context—complex co-occurrences, or one of the 12 top avenues from Reactome. When compared in 5 cross-validation, the model outperforms other methods in predicting functional interaction (A, B) and interaction context (B). The error bars represent the 95% confidence interval using 1000 leading samples. MLPP-machine learning phylogeny analysis, NPP-standardized phylogeny analysis, SVD-Phy-singular value decomposition phylogeny analysis, PPP-PrePhyloPro, Hamming-binary phylogeny analysis using Hamming distance. The source data is provided as a source data file.

Since some deviations may obscure this comparison, in addition to randomly assigning to training and test splits, we also performed hierarchical cross-validation. Previous studies have found that functional interaction prediction models tend to overfit genes found in pairs in the training and test sets (but not in the same pair). Therefore, the gene pairs in the test set are stratified to have two genes in the training set, only one gene, or none, as previously suggested (see "Methods", Supplementary Figure 5, Supplementary Table 4). In addition, genes with high sequence similarity (such as paralogous genes) are often functionally related and co-evolve. However, this relationship is easy to capture in the absence of PP, which produces optimistic results for predicting functional interactions. Therefore, we have stratified the gene pairs of this phenomenon (see "Methods"). When the gene pairs with high sequence similarity are filtered out, the performance difference is more obvious (Supplementary Table 4, Supplementary Figure 5). Another consideration is genetic age. Newer genes (ie, appearing for the first time in a common ancestor closer to humans) may prove more difficult for methods based on co-evolution. This can be attributed to the greater similarity between closer organisms, resulting in high phylogenetic profile similarity between these genes, regardless of function. Therefore, we stratified this phenomenon and showed that the performance of the model is indeed reduced for a subset of genes found only in metazoans and chordates (see "Methods", Supplementary Figure 4). However, these gene pairs constitute only a small part of Reactome's functional interaction (metazoa specificity is 1%, chordae specificity is 0.5%, mutual tolerance) and a high proportion of paralogous pairs (metazoa 20% , Chordates is 17%, 5 percent of all genes).

In addition to functional interaction, our model also predicts the context of the interaction for each gene pair. Previous studies have shown that interactions belonging to different interaction backgrounds may show different global phylogenetic profiles. The interaction context represents additional information about gene-to-function interactions, such as pathway type. We show that our method is superior to other PP methods in predicting the types of pathways from Reactome and advanced terminology from GO, and achieves high auROC, partial auROC and average accuracy in cross-validation and stratification (Figure 1B is similar to NPP). For other comparisons, please refer to the supplementary information).

Further comparisons of time-splitting and externally validated databases revealed similar performance improvements. We evaluated the performance of the functional interaction model trained on Reactome (February 2019) in predicting the functional interaction of future versions of Reactome (January 2021). Our model is robust to these temporal changes (Supplementary Figure 6). In addition, we externally verified the performance of our functional interaction model and trained on Reactome's functional interaction to predict functional interactions from data sets similar to and different from Reactome. Our model is robust to predicting PPIs from BioGrid-the general biological knowledge base of the interaction data set 28 (Supplementary Figure 7A-D) and from IntAct-EMBL-EBI Molecular Interaction Database 29 (Supplementary Figure 7E-H) ; Functional interaction from the Kyoto Encyclopedia of Genes and Genome (KEGG) database (Supplementary Figure 8); Co-occurrence of protein complexes from CORUM31 (Supplementary Figure 9A-D) and IntAct Complex29 (Supplementary Figure 9E-H) databases. For protein complex co-occurrence, we also compared the PP method with the “In Complex” interaction context model trained on complex co-occurrence in Reactome (Supplementary Figure 10). These external validations are robust for each data set in the entire data set and exclude functional interactions found in Reactome and exclude functional interactions between paralogous genes.

Since phylogenetic analysis is often used to understand functional interactions at the pathway level, we compared different methods at this level. For each pathway, we calculated the paired scores of all gene pairs in the pathway. In order to be able to compare between different methods, the scores of all gene pairs for each method are normalized by converting to percentiles. Comparing the median percentile of each pathway of all KEGG pathways, MLPP (Functional Interaction Model) is superior to the NPP method in 77.5% of the cases, and 43.8% of the pathways are determined at the 95% percentile level ( Figure 2A, supplementary Figure 11A). For example, for KEGG30 pathway fatty acid metabolism, compared with BPP (Figure 2C) and NPP (Figure 2D) methods, MLPP predicts its success at a higher percentile (Figure 2B, the redder, the higher the percentile). Pair interaction. The KEGG valine leucine and isoleucine pathways also show a similar comparison (Figure 2E-G). Taking into account the sequence similarity (Supplementary Figure 11C), compared with the BPP method (Supplementary Figure 11B, D), and when compared with the CORUM complex database (Supplementary Figure 11E-H), the performance has been greatly improved.

The MLPP model was compared with the NPP model in predicting the pathway by the median percentage of the score in the pathway in the KEGG database (A). Specific examples of fatty acid metabolism (BD) and valine-leucine and isoleucine degradation (EG) pathways are given, and MLPP (B, E), BPP (C, F) and NPP (D, G) are compared ). MLPP-machine learning phylogeny analysis, NPP-standardized phylogeny analysis, BPP-binary phylogeny analysis using Hamming distance. For DI, the color represents the percentile, the scores for each method are in parentheses, and the red dots are paralogous proteins. The source data is provided as a source data file.

Then we apply our method to identify functionally interacting gene modules. To this end, we cluster genes through predicted functional interactions (that is, the predicted probability of all gene pairs interacting), and extract closely interconnected modules through hierarchical clustering (see "Methods"). We have identified many modules of known functionally interacting genes (Figure 3). Some modules, such as cilia 5, 14 and heme biogenesis gene 2, were previously described as highly co-evolved (Figure 3B, G, respectively). However, we also determined that the signal is indeed only contained in clusters in the clade subset, so our method is easier to find. For example, the mitochondrial respiratory complex III and IV genes (Figure 3C) and NADH dehydrogenase (Figure 3D) have strong co-evolution signals in fungi. Other clusters, such as the B12 metabolic cluster (Figure 3E) and histidine catabolism cluster (Figure 3F), show signals in both fungi and nematodes. Finally, the cluster in Figure 3A contains genes related to mRNA splicing and some genes that were not previously related to splicing (red). Many modules found mainly contain genes with high sequence similarity (Supplementary Figure 12), such as alcohol dehydrogenase (A), receptor/ion channel subunits (B, D, G), ribosomal subunits (E, F), exosomes (C), collagen subunits (H) and histones (I). As mentioned earlier, these modules are expected because genes with high sequence similarity are highly co-evolved and are usually functionally related.

The functional interaction model predicts clustering using hierarchical clustering, and cutting at a specific height to generate clusters. For each cluster AG, the top is the clade importance of each clade calculated using the average SHAP value, and the species are ranked from close (left) to far away (right) humans. At the bottom is the phylogeny profile, which is a bit score normalized by self-hit. The subgraphs correspond to pathway mRNA splicing (A), cilia (B), mitochondrial respiratory complex III and IV (C), NADH dehydrogenase (D), B12 metabolism (E), histidine metabolism (F), and blood red Element biogenesis (G). The red entries in A indicate unknown genes involved in mRNA splicing.

Next, we apply our method to predict the biological functions of the less studied genes. Many genes (called "Ignorome") are less studied and rarely mentioned even in the literature15. This makes understanding their function challenging. Our system and unbiased method do not rely on additional data and can be well generalized to genes found only in the test set (see "Methods"), and therefore can help capture the function of Ignorome. Therefore, we focused on genes that lack functional annotations in Uniprot32. These genes also need to be in the lower 20% mentioned in PubMed (less than 10 papers) or belong to ~2000 genes identified by the neXtProt consortium as genes with unknown functions (Supplementary Figure 13).

In order to predict gene function, we use gene priority based on random walk, which we call "PathScore". Using the MLPP method we described above, we generated a complete predictive functional interaction network for each interaction context path type model. Then we score genes based on the balanced distribution of random walks on the network (see "Methods"). Based on the connectivity of each gene in the predicted functional interaction network, the score given for each gene indicates its importance to the pathway type (see "Methods"). For DNA repair, PathScore ranks genes known to belong to this pathway type at the top (Figure 4A), and is robust in training and test splits in multiple cross-validation (Figure 4B, similar to the rest can be found) Analysis) Supplement the interactive context model in Figures 14, 15). By examining less studied genes, we identified dozens of genes in the first 250 PathScore levels of each pathway type, and generated one or more annotations for 238 Ignorome genes (Figure 4C, Supplementary Data 3).

PathScore is a random walk-based metric used to identify the importance of genes in a given network. PathScore is calculated based on the tags of all genes, and priority is given to unknown genes (gray) that do not belong to the pathway, whose PathScores are equal to or higher than those of known genes (red). Shown as pathway type DNA repair (A). For genes belonging to specific pathways, PathScore is higher and summarizes genes in pathways found only in the test set. Shown for DNA repair (B, 240 known genes). Similar performance indicators appear in the supplementary figure. 14-15 For other types of pathways considered. The box plot extends from the lower quartile of the data to the upper quartile with an orange line in the middle. The beard means 1.5 times the interquartile range. The first five less studied genes (not functional in UniProt, except for a few Pubmed mentions or appearing in NextProt's uncharacterized gene set) are selected for each pathway type, and they are used to compare pathway types from Reactome and GO The PathScore is represented in descending order (C). Genes are annotated by whether they appear in the set of uncharacterized genes in NextProt (shown in the list by dots), the number of Pubmed mentions ("#Pubmed"), and the number of related GO terms ("#GO"). GO-gene ontology. The source data is provided as a source data file.

Specifically, for DNA repair-type pathways, we have identified several potentially functionally related genes. Among the top 50 genes ranked for DNA repair, we identified 9 genes annotated as relevant in Reactome (22 of the top 200). In addition, we identified nine genes known to be related to DNA repair but not found in Reactome. These include EXO5 (ranked 8), an exonuclease 33 related to DNA repair and genome stability; C17orf53 (ranked 16), formerly an Ignorome gene, was recently identified as involved in homologous recombination repair 34; DNA polymerization Enzyme DNTT (ranked 18); telomerase TERT (ranked 23); SMC5-6 compound gene SMC5 (ranked 47, also ranked 55th EID3) 35; and previously prioritized genes related to DNA repair (ELP636-33rd , PIGN37-34th, NUDT1538-36th, STK1939-46th). These are strong external verifications of the PathScore priority method. In addition, we identified 18 of the top 200 genes, which were identified by multiple CRISPR analyses as priority 39 related to DNA repair (ranking in brackets)-GPATCH8 (3), SCNM1 (7), OMA1 (10), AOC2 (50), RCE1 (71), ALG3 (92), THUMPD1 (111), DPH6 (115), PIGW (119), TYW1 (131), VPS16 (132), PPOX (143), DUSP12 (146), ISCA2 (158), NAALADL2 (187), POLA2 (194).

This approach also highlights hundreds of genes that may be functionally related to several pathway types. In total, we identified 1,554 non-Ignorome genes and 58 Ignorome genes, which are located in the top 250 of more than one pathway type. For example, Yippee-type proteins YPEL1, YPEL2, and YPEL4 rank high in cell cycle, disease, gene expression, homeostasis, protein metabolism, signal transduction, small molecule transport, and vesicle-mediated transport, and are ranked with YPEL1 and YPEL2. Among the top 250 of these eight pathway types. Yippee family proteins are putative zinc fingers known to be related to centromeres.

Phylogenetic analysis can be traced back to generate evolutionary insights into pathway evolution. These insights include important loss events5,12 and analysis of different pathway types14. Our method achieves similar evolutionary reasoning by calculating the contribution of each clade (feature) to the prediction of gene-function interaction. The tree-based model's SHAP method is used to calculate the contribution of the clade to the prediction41,42. The SHAP value is calculated by considering the predicted changes in the model with or without clades for all possible combinations. For example, for the gene pair ACO1-IDH1 from the citric acid cycle (TCA), the probability of functional interaction is 0.87. The probability can be decomposed by SHAP into 0.117 for fungi, 0.06 for stain bacteria, -0.01 for ascomycetes, and so on (the bias term is 0.427, Figure 5A; clades with absolute SHAP values ​​less than 0.002 are not shown). The interpretation of these values ​​is conceptually similar to the interpretation of the coefficients and intercepts (deviations) of linear models. Evolutionary inference is done at the clade level and therefore cannot point to the time of a specific loss event, such as described previously, for example in the references. 5. Nevertheless, it can reveal the clade of possible loss events, the first introduction of this pathway, or the loss of co-evolution at the common ancestor level. In addition, our model allows a unified assessment of these evolutionary insights for all gene pairs, pathways, and pathway types. Therefore, we put forward insights into functional interaction and pathway evolution at these three levels.

SHAP values ​​were calculated for all gene pairs in the citric acid cycle from KEGG. Here is a specific example of the gene pair ACO1-IDH1 (A). The bars represent the calculated SHAP value for a specific clade and are colored by that value. The SHAP bias term is the probability when the value of any clade is not known. Calculate the average SHAP value of all gene pairs (B). Shown are clades with an average SHAP value higher than 0.002 and colored by the average SHAP value. Species go from 0 degrees close to humans to farther counterclockwise. See the supplementary material for clade abbreviations. All pairwise interactions are shown in the network of fungi and Chromadorea, which are the first two most important clades of SHAP values ​​(C, D, respectively). The color of the edges is similar to the clade in A. The normalized position score matrix of the fungal species for all genes in the pathway highlights their loss at the Microsporidian clade (E).

For example, we focus on the citric acid cycle (TCA) pathway. The model identified Fungi and Chromadorea as the clade with the highest importance for prediction in this pathway (Figure 5B). These clades are complementary in predicting some functional interactions. Although the fungus is the most informative single clade, it failed to predict the interaction of the genes PCK1 and PCK2 with the remaining TCA genes. However, these interactions are well captured in Chromadorea (Figure 5C, D). This may be related to the function of PCK1/2, because these genes control the gluconeogenesis of the intermediate metabolites of TCA and therefore are the periphery of this pathway.

Overall, the phylogenetic spectrum of TCA in fungi showed that, except for the known pathways lost in the microsporidian parasite, most genes are conserved throughout the clade (Figure 5E, boxed). Therefore, this model uses the importance metric provided by our method to link the phenotypic changes of Microsporidium to loss events, proving its applicability in identifying evolutionary insights. Two other examples of pathway evolutionary insights are provided; methylmalonic acid metabolism and histidine metabolism (Figure 3E and F are at the top of each subgraph, respectively). In these pathways, the model identified specific loss events in nematodes by clade importance. Overall, we show that our model can capture specific loss events similar to those found in previous methods5.

Next, we seek to gain insights at a higher level of pathway co-evolution. We first assessed the general information content of clades in predicting functional interactions. In general, for functional interactions, the most critical clades are fungi (average absolute SHAP value of 0.04), nematodes (0.022) and their daughter clade fungi Incartae sedis (0.03) and Chromadorea (0.033) (Figure 6A, Starting from the top, the importance is reduced to the average absolute SHAP value). Unexpectedly, these specific clades are of higher importance than the use of all eukaryotes (0.018), indicating that specific clades can usually prove more information, both for our methods and general Phylogenetic analysis.

The clade importance of the SHAP value is calculated for a single cross-validated test set in the functional interaction model, revealing the clade with the highest average absolute importance (A). The clade importance of each response group pathway is averaged and projected to two dimensions using UMAP (B). For each pathway, a marker is displayed by the average SHAP value of one of the four specific clades of a quarter marker color. The average predicted probability of the gene pair is shown by the color of the middle circle. The source data is provided as a source data file.

For the interactive environment, different pathway types have different clade importance (Supplementary Figure 16). It was initially expected that the more "old" pathway types would rely more on distant organisms, and vice versa. This assumption is summarized in our analysis. For example, the metabolic model assigns higher importance to clades that are farther away, such as alveolar animals, stramenopiles, fungi, and all eukaryotes, while the immune system model assigns higher importance to those closer to humans. Clades, such as metazoans and molting animals. However, some approaches show counterintuitive attributions for the importance of clades. For example, signal transduction models attach great importance to all eukaryotes and are expected to rewire frequently to provide more information in organisms closer to humans.

Then we checked the clade importance of gene pairs and pathways. For this reason, UMAP44 is used to predict the average SHAP value of each pathway. This analysis clustered paths with similar clade importance attributes (Figure 6B). For example, many metabolic pathways (Figure 6B, lower right corner) place great emphasis on Chromadorea and fungi. Likewise, a set of receptor types, complexes, and signaling pathways are very important to all eukaryotes and Chromadorea (Figure 6B, upper left corner). These differences highlight the added value of clade phylogenetic analysis, which can detect co-evolution in a subset of eukaryotic trees. The UMAP projection of genes to SHAP values ​​confirmed similar insights. Here, clusters of gene pairs with similar clade importance indicate that the pair with the highest score has high importance for both fungi and nematodes (Supplementary Figure 17).

Overall, our method enables people to discover specific patterns across gene pairs, specific pathways, and all pathway levels, thereby revealing pathway evolution. These insights can be divided into two categories. First, identify clades with gene loss events that translate into meaningful phenotypic effects, and second, reveal the underlying evolutionary processes behind various pathways. These include the gain of each pathway, the difference between the clades in the loss of pathways, and the amount of information in the phylogenetic analysis of the functional interaction of various clades in general and specific pathway types.

Many of the most informative clades described above, such as Chromadorea, Stramenopiles, Alveolata, and Fungi Incertae Sedis, contain a large number of parasites. It is well known that parasites experience a large amount of gene loss and are very different from their free-living counterparts 45, 46, 47, 48. Therefore, we hypothesize that these insights may be relevant and identify parasitic organisms in the tree of life as a key signal for phylogenetic analysis.

Parasites (see "Methods", Supplementary Table 5) are generally less conservative to humans than free-living creatures (Figure 7A, red). The lowest percentage of orthologs were found in parasites of alveoli, microsporidia (Fungi Incertae Sedis, denoted as fungus IS), kinetoplasts, and intestinal flagellates (hexanematidae, denoted as other eukaryotes) ( Figure 7A, red). Only two organisms showed similar levels of loss, a microalgae (Nanochloropsis gaditana in Stramenopiles, marked with a green arrow) and an endosymbiotic Kinetoplastida (Perkinsela, marked with a red arrow). For endosymbionts, the same reasons for gene loss associated with host adaptation were previously described.

The proportion of human genes found in each organism is shown on the y-axis, and the parasites are marked in red. Two non-parasitic organisms with low conserved gene scores are highlighted with green (Nannochloropsis gaditana, in Stramenopiles) and red (Perkinsela, in Kinetoplastida) arrows (A). Then compare the scores of conserved genes in the six clades with many parasites between the parasite, non-parasitic organisms, and the reference (parental) clade. A two-sided Mann-Whitney test was used to compare each clade between the parasitic organism and the reference and non-parasitic organisms; the p-value is shown for a significant comparison (p <0.05). The box plot extends from the lower quartile of the data to the upper quartile with an orange line in the middle. The beard means 1.5 times the interquartile range. (two). In addition to the species level, comparisons were made by pairing the average conservation of each gene in the parasite (red) and the reference clade (green) with the line (C) connecting them. Genes that are completely lost or have low conservation in at least one parasitic clade but are highly conserved in all eukaryotes were tested for combinations of losses across these clades. Check whether there is overexpression of gene ontology (biological process ontology) in the genes in the first 10 intersections. For each combination (D), the first five items for adjusting the p-value by FDR are displayed. The figure above shows the number of genes in each clade or intersection, and related clades are marked with black circles. The following panel shows the number and importance of the most relevant pathways. FDR-false discovery rate. The source data is provided as a source data file.

The six parasites containing clades were further compared with non-parasitic (free-living or symbiotic) organisms in the same clade and the reference parental clade (Figure 7B, C, Supplementary Table 5 can find the parasitic organisms) For a complete list, see "Methods"). The parasitic clade showed that the protection levels (compared to humans) of non-parasitic organisms (nematodes, alveoli) and reference clades (except for stratum corneum, 2-sided Mann-Whitney test, Figure 7B) were statistically significantly reduced . In addition, many missing genes in parasites are highly conserved among non-parasitic organisms in the reference clade (Figure 7C).

Therefore, we are interested in mapping the missing genes in each clade and how that signal is transformed across clades and pathways. We analyzed genes that are highly conserved in all eukaryotes but less conserved in at least one parasitic clade (see "Methods"), and finally got 4114 genes (Figure 7D). The existence and absence of each gene in each clade were considered, and the enrichment analysis of the gene ontology (GO) biological process was performed on all clade combinations (see "Methods"). Parasitic clades usually lack orthologs of specific metabolic, signal transduction, and developmental pathways (Figure 7D, the top 10 clade combinations ranked by the number of genes in the intersection). By examining different combinations, we found that certain pathways are enriched in certain combinations. For example, the mRNA splicing gene is lost in both Kinetoplastida and Microsporidia, while the GTP signaling gene is lost in the second, third, or eighth combination (from the left), all consisting of Microsporidia and Microsporidia. Therefore, some pathways can be identified by their loss patterns in parasites.

However, sensitivity analysis that excludes parasites in method training has led to conflicting evidence. On the one hand, although many of the most important clades contain parasites (as described above), excluding parasites or these clades will only result in a slight decrease in performance (see "Methods", Supplementary Table 6). On the other hand, the importance of clades did transfer from these clades to other clades (Supplementary Figure 18). This shows that although parasites do contribute to the model proposed in this work, there are other signals in the phylogenetic analysis that can achieve similar performance.

Evolutionary methods are one of the main sources for understanding gene functions and interactions1,2,3,4,5,6,8,9,10. Here, we propose a machine learning method for phylogenetic analysis and prove its utility in predicting functional interactions. By using clade phylogeny analysis, our method predicts the functional interactions of genes and reveals the evolutionary insights behind the predictions. We apply our method to predict the putative functions of genes that have been studied less frequently, and to explore evolutionary signals found in parasites.

In addition to using the entire eukaryotic tree of life, our method also extends phylogenetic analysis by using machine learning to recover the signals found in a single clade. This clade phylogenetic analysis method assumes that for certain functional interactions, co-evolutionary signals are better captured in a subset of the tree. This may be due to genes that appeared later in evolution (for example, specific to metazoans or mammals), or, more interestingly, genes that work together in only a subset of the tree. There may also be the opposite situation, where pathways are broken down, that is, composed of genes found in humans that are not functionally related and therefore no longer co-evolve in a particular clade, but still co-evolve in other clades.

Our work goes beyond previous insights into the utility of clade phylogenetic analysis, which are limited in terms of clade resolution or applicability to entire interactive groups3,7,10. We propose a fully formed clade phylogenetic analysis method, combined with a supervised machine learning method, which greatly improves the performance of predicting functional interactions.

In addition, our method provides insights into the importance of clades for specific predictions, thereby identifying potential evolutionary signals. These evolutionary insights are critical to understanding how pathways might evolve and generating broader perspectives for prediction. This provides a practical and scalable alternative calculation method for comprehensive inferences of phylogenetic trees5,19,50. We show here several different levels of evolutionary insights. At the single pathway level, our method identifies clades with strong co-evolution signals (such as independent loss events). This is similar to the insights derived from a complete inference of the phylogenetic tree. By examining the clade importance of the interactive context model, we summarized the hypotheses stated in the literature, such as recent evolutionary pathway types (for example, immune pathways) that confer more importance on metazoan clades. Finally, at the macro level, we consider which clades are the most important for prediction and, as agents, have the most evolutionary information. Our analysis extends the various evolutionary insights obtained from phylogenetic analysis, from paired functional interactions to mapping evolutionary insights from all pathways together.

Several of the most important pathways identified by our method contain parasites. However, a different hypothesis can be considered in which clades with high variability produce the strongest signals, while parasitic species provide only one such example. Our analysis of the exclusion of parasites from model training further confirms this. As an illustration, we found that nematodes are one of the most important clades of this model, in which nematodes have previously been determined to have a high degree of difference in their evolutionary distance regardless of their parasitic status 51,52

Many human genes are still mostly featureless. In recent years, this problem has aroused people's attention, and there have been rigorous discussions on solutions15,16,17. One such solution is to characterize these genes by using unbiased (or minimally biased) information from mRNA sequencing and other high-throughput experimental methods15,17. Phylogenetic analysis provides an unbiased way to understand gene function through functional interaction captured by co-evolution. Although not completely unbiased, our method predicts many promising functional annotations for less studied genes, which can be further explored through computational and experimental techniques. Especially for DNA repair, external verification has shown that many of the priority genes are indeed related, which paves the way for studying other top-ranked genes on this list.

However, our method still has some limitations. First, the low performance of our method on young genes inhibits our use of these genes. This may be due to our method's high reliance on more distant clades rather than lack of co-evolutionary signals. Second, the supervised nature of our method suggests that it may not be suitable for studying fewer genes. Although our analysis of stratification shows that our method is indeed robust in cross-validation and external validation of unknown genes, some performance losses have also been noted. This may be mitigated in future work by using unsupervised machine learning methods, such as an autoencoder based on the work of SVD-Phy25. Finally, our functional annotation method, PathScore, is intentionally limited to large channel types and is not suitable for small channels due to the supervised nature. Modifying PathScore to take advantage of other network propagation algorithms, such as a method that allows multiple specific seeds, can be used for smaller paths, similar to other methods in the literature53.

As more species are sequenced, we believe that using machine learning-based methods to observe evolution systematically and across different evolutionary scales will lead to further improvements in performance and reveal predictions and evolutionary insights. This work highlights the main potential of large-scale genome analysis in underestimating genotype-phenotype interactions and crosstalk between genes, functions, and evolution. This work is accompanied by a web server that can explore the predictions in this article for functional interaction and feature annotation predictions, available from the following URL: https://mlpp.cs.huji.ac.il.

In order to simulate the co-evolution of genes, we generated a gene phylogeny similar to the previously described 1,2,3,54. First, we used the programming API32 (accessed on 28.12.2018, https://www.uniprot.org/uniprot/?query=proteome:Proteome_ID&format=fasta) to download the proteome of 1154 species from UniProt (see source data) as FASTA file). To further enrich these proteomes, we used the proteome of the RefSeq non-redundant protein database 53 from NCBI (25.12.2018, ftp://ftp.ncbi.nlm.nih.gov/blast/db/) to extend each Species FASTA. Then we constructed a reference set of human proteins so that each human gene has a representative protein. We searched the human proteome from UniProt (accessed on 25.12.2018) and selected the corresponding longest protein for each human gene as a reference.

We used BLAST55 to compare each human protein with its one-way best hits in each of 1154 species. To do this, use the parameter "-max_target_seqs 1" to execute blastp on the command line 56 (version 2.7.1) to keep only the highest hits for each gene of each species. The output of this step is a matrix B, so that each element Bi,j contains the explosion bit score between the human gene (as a reference protein) i and the best hit of the species j. The bit scores below the set threshold of 60 are set to zero. Supplementary Table 3 provides an analysis of several bit scores (40, 60, 100) and E-value (1e-3) thresholds, showing the robustness to the selected threshold. Then we divide each row (gene) by the bit score of the human protein self-attack to explain the length of the protein and the evolutionary distance from the reference organism 57.

Retrieve taxonomy annotations of species from UniProt Taxonomy 32 (visit 01.09.2019, https://www.uniprot.org/taxonomy/). The taxonomic pedigree information is divided into clades, and clades with less than 10 species are filtered out. Next, the clades are sorted by decreasing size and filtered according to Jaccard similarity, so that clades with a Jaccard similarity greater than 0.8 to the larger clade are considered redundant and filtered out. Both redundant and non-redundant clade sets can be found in Supplementary Table 1 (non-redundant is marked in blue).

Pathways and genomes are from Reactome (Croft et al., 2011), Kyoto Encyclopedia of Genes and Genomes (KEGG)30, CORUM31, and UniProt GOA58. The Reactome pathway is downloaded as a GMT file (reactome.org/download/current/ReactomePathways.gmt.zip, accessed on 25.01.2019), with gene symbols and pathway descriptions (reactome.org/download/current/ReactomePathways.txt, accessed on 05.02) .2019). For time division, the Reactome data was re-acquired on January 12, 2021. The path hierarchy is downloaded as an adjacency list (reactome.org/download/current/ReactomePathwaysRelation.txt, access time is 05.02.2019). Use the REST API to retrieve the reaction group complexes and filter to retain only the protein components (query on 12.02.2019). The KEGG pathway is retrieved from MSigDB v6.2 as a GMT file with gene symbols (http://software.broadinstitute.org/gsea/downloads.jsp, accessed on 28.11.2018). The CORUM complex was downloaded from the website and converted into a GMT file with gene symbols (https://mips.helmholtz-muenchen.de/corum/download/allComplexes.txt.zip, accessed on 12.02.2019). UniProt GOA was downloaded as GAF (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human.gaf.gz, access time is 04.02.2019) and matched with the gene ontology term description Generate GMT files 59,60 (as OBO, http://geneontology.org/docs/download-ontology/, accessed on February 4, 2019).

Standardized preprocessing pipelines are applied to each pathway and gene set source. Each GMT filters a genome, which contains at least three genes and a maximum of 50 genes. Next, assuming that the gene sets are fully connected (that is, considering all pairwise connections), an adjacency list is generated for each gene set. Link these neighbor lists to generate a list of all gene pair pathways co-occurring from each source. For Reactome, an additional step was taken to divide the pathways through top-level pathways. The top-level pathways are filtered out, and those pathways with at least 5000 gene pairs are retained, so finally 12 top-level pathways remain. Similar top-level pathway attributes were generated for GO terms, and finally there were 28 pathway types—17 GO BP (biological process), 6 GO CC (cell compartment), and 5 GO MF (molecular function) terms.

The interactive data comes from IntAct29 and BioGrid28. Retrieve the IntAct complex and BioGrid paired interaction (as a SIF file) from PathwayCommons61 (version 10, https://www.pathwaycommons.org/archives/PC2/v10/, accessed on 04.08.2018).

The machine learning model uses scikit-learn62 and lightGBM63 to train in a multi-label manner. We compared decision trees, logistic regression, naive Bayes and random forests (using lightGBM), and chose the highest performing lightGBM (Supplementary Figure 3). The lightGBM model is further tested on different positive and unlabeled frames to predict hidden positive links (additional "methods"). For each label (functional interaction, i.e. any pathway co-occurrence, or interaction context, i.e. specific pathway type co-occurrence or protein complex co-occurrence), train the model to use the measured covariance between gene pairs to predict the label 49 The configuration file in each of the clades (as described in Section 8.1). The model is trained in a five-fold repeated cross-validation (CV) method to evaluate performance. Each CV fold consists of a random sample of known (positive) gene pairs that match a random negative pair. Select random negative pairs so that the number of genes in the negative pair set approximately matches the number of genes in the positive pair set to maintain the topological structure. For each cross-validation fold, the gene content in the training and test splits is stratified. Park and Marcotte27 showed the effect of a single gene in the interaction pair when performing pairwise interaction prediction. Therefore, we performed a stratification similar to the stratification suggested in their original paper by using 30% of the genes in the positive and negative pairs as the only genes in the designated test set. Then we divide the test set into three parts: C1-both genes appear in a certain pair in the training set, C2-only one of this pair of genes appears in the training set, and C3-neither of the genes appear in the training set Use training in the training focus. In addition, because paralogous genes have similar phylogenetic profiles and may act as "data leaks", each gene selected as the test set is assigned to all genes with high sequence similarity to it (bitscore> 60). As a group. CV folding is also designated as exclusive to the test set. The final lightGBM random forest model is used to predict the probability of all about 200 million possible gene pairs in each label, which is both the average value of CV folds and the average value of each CV fold itself.

Functional interaction prediction is a classic problem of positive unlabeled learning. The annotations for functional interactions only include confirmed real interactions and non-interactive pairs that lack confirmation. Supervised learning can be used in this case to treat unlabeled pairs as negative; however, several methods have been developed to handle unlabeled data 21, 22, 23, 24 more appropriately. Therefore, we simulated a situation where known positive examples were hidden (unmarked) and checked the extent to which different methods can restore these positive examples. We compared four methods (a) ordinary light gradient enhancement machine (LGBM), a standard LGBM classifier, unmarked as negative numbers, (b) PUBag, a positive unmarked bagging of LGBM trees, where positive numbers Is a constant, unlabeled sampled For each model in the bagging process, inspired by Mordet and Vert et al. bagging SVM22, (c) AdaSample, an adaptive sampling method in which negative and positive numbers are selected for each classifier in the set according to the probability of belonging to the category in the previous iteration, and (d) random forest model, LGBM RF, Capturing features similar to the SVM integration described by Claesen et al. 23. The basic classifier of PUBag and AdaSample is a default LGBM classifier with 10 trees. For vanilla LGBM, the default LGBM classifier has 200 trees, and for LGBM RF, we use random forest mode ("boosting_type ='rf'"), 200 trees, subsampling is 0.5, feature sampling is 0.5, and deeper The tree has 128 "largest leaves". Performance is measured by the area under the receiver operating characteristic curve (auROC) and the probability distribution of positive, random negative, and hidden positive gene pairs. PUBag uses R. Wright64's github implementation to run, sub-sampling is 0.5, and feature sampling is 0.5. AdaSample was implemented by the author based on R to achieve 65,66 in Python and run two iterations with a sub-sampling of 0.3 and the results of 20 models. We cross-validated and compared models with four different hidden scales: 0.1, 0.3, 0.5, and 0.7. For 0.3, 30% of the positive pairs in the training set are considered negative pairs (Supplementary Figure 2).

We compared our method (described in Section 8.3) with four established phylogenetic analysis methods: standardized phylogenetic analysis1,2, singular value decomposition of phylogenetic analysis (SVD-Phy)25, two Hamming distance (BPP) 9,26 and PrePhyloPro26 on metaphysical analysis (BPP). For NPP, the matrix is ​​prepared as described above (Section 8.1), and then further normalized by first taking log2 of the matrix and performing standard scaling on the column (species) by subtracting the column average and dividing by the column standard deviation. Use Pearson correlation to calculate the similarity between genes. For SVD-Phy, the matrix is ​​prepared as described above, and the truncated SVD calculation uses the first 35% of the components, as described in the original paper 25. The similarity between genes is calculated by obtaining the Pearson correlation between the components. For Hamming and PrePhyloPro matrices, the blast output is treated as an E value instead of a bit score, and is binarized by assigning 1 (orthologs present) to E values ​​less than 10−3 and 0 (orthologs lost)化, otherwise. For the Hamming distance binary profile, the Hamming distance is used to calculate the similarity between genes. For PrePhyloPro, the Jaccard index level of a single gene and all other genes is used to calculate the similarity between genes. As described by Niu et al. 26, as long as the Pearson correlation between profiles is less than 0, the ranking is considered the last.

The comparison between our method and other PP methods is aimed at functional interactions in Reactome (Figure 1A, Supplementary Figure 5, Supplementary Table 4), interaction context model-Reactome complex (Figure 1B, Supplementary Figure 5, Supplementary Table 4) ), the channel type from Reactome (Supplementary Data 2), and the channel type from GO (Supplementary Data 2). The comparison of "young genes" (Supplementary Figure 4), response group time split (Supplementary Figure 6), external verification of various databases (Supplementary Figure 7-10) and pathway level analysis (Figure 2) are also shown, Supplement Figure 11).

In order to determine the centrality of genes in a specific network of interaction contexts, PathScore is designed. PathScore is the stationary distribution of random walks on the predicted interaction network for each label. Specifically, the predicted probability of a particular label is first sparsed by turning all values ​​below the 75% percentile to 0. Then each row is divided by its sum (ie node degree) to create a random matrix. Next, use the power method to calculate the stationary distribution, and then scale to the range [0–1] to generate PathScore. In addition to the PathScore value, its descending rank is also used, for example, see Figure 4C. For each path type, the accuracy of PathScore at level 100 is calculated using path types with an accuracy of less than 10%, resulting in 30 path types (Supplementary Data 3, Supplementary Figure 15).

The less studied gene is defined as neXtProt (Duek et al. 2018, Accessed 03/04/2019, https://www.nextprot.org/about/human-proteome page bottom, or query ID NXQ_00022) and the gene comes from gene2pubmed Low PubMed mentioned (accessed 03.04.2019, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz). Both are filtered by genes whose gene symbols exist in the phylogenetic analysis matrix, and there is no functional annotation in UniProt. Gene2pubmed is further filtered to keep only the bottom 20% of PubMed mentions, resulting in genes with 10 or fewer published mentions.

Since the selected model is lightGBM, the Shapley value can be used to calculate the feature importance, which is implemented in the SHAP method of the decision tree-based model41,42. The SHAP method uses the Shapley value, which is a credit attribution method based on game theory in the process of multi-agent collaboration. In the case of feature importance, each feature (in this case the covariance in a particular clade) is treated as an agent, and each individual prediction is a collaborative process involving these features. The SHAP value can be positive or negative, and can be thought of as "will this particular feature increase or decrease the probability of prediction?". In order to calculate the SHAP value, the "predict_proba(X, pred_contrib = True)" method of the lightGBM classifier object is used. Since the model is a random forest model, the SHAP value is given as the sum of the probability contributions of each tree in the set. Therefore, the SHAP value is averaged on the tree to obtain the predicted probability contribution of the ensemble prediction.

The clade importance prediction is performed using the UMAP44 python package. The parameters used are n_neighbors = 10, min_dist = 1, spread = 3. UMAP projection to two dimensions is applied to the average SHAP value of 49 clades corresponding to each gene pair in each pathway in Reactome. Then use a custom matplotlib script to plot it to create Figure 6B. For Supplementary Figure 17, clade SHAP values ​​are used for all gene pairs in the test set of the first cross-validation folding.

In order to analyze the performance of the model by genetic "age", the genes are classified as those that first appeared in chordates, metazoans, or all genes. This classification uses the BPP matrix shown above (Section 8.5), which is a phylogenetic matrix binarized by the BLAST E value threshold. Therefore, if according to the BLAST E value standard, if the species far from the metazoan does not have any gene orthologs, the gene is classified as metazoa-specific, and the same is true for chordates.

A list of parasites and clades was manually prepared using annotations from The Encyclopedia of Life67 and GloBI68. The parasitism of some species was further manually disambiguated. This list can be found in Supplementary Table 5. According to the size of the clade and the number of parasites, six main clades containing parasites were selected-Alveolar, Nematode, Protophyte, Microsporidia, Platychondria, and Kinetoplast. These clades contain parasites and non-parasites (symbiotic or free-living organisms). For gene enrichment analysis, genes are first filtered into genes that are conserved in eukaryotes (more than 75% of all eukaryotes in the data) and genes that are not conserved in at least one parasitic clade (in less than 25%). Found in parasites) in a clade). Then for each gene in each selected clade, if it appears in at least 50% of the clade, it is considered to be found, otherwise it is not found. This produces a binary appearance matrix. Intersection and UpSet Figure 69 is generated by UpSetPlot Python package 70. GO term enrichment is done using the ClusterProfiler package in R71.

For the parasite exclusion analysis, use all clades as described above to retrain the model, exclude clades with any parasites and exclude all parasitic species as described above. Exclude species by setting the value to NA. These models are trained to predict functional interactions in Reactome and are layered similarly to other analyses in the paper. These stratifications are based on the presence of genes in the test set (ie, the Park-Marcotte-inspired split) and the exclusion of collateral pairs. The results are shown in Supplementary Table 6. The clade importance analysis was performed as described above (Section 8.7) and is shown in Supplementary Figure 18. As the models are retrained, there are some differences between these models and the main models, such as slight changes in performance indicators and attribution of clade importance.

For more information on the research design, please see the abstract of the nature research report linked to this article.

The data can be explored through the attached web server: https://mlpp.cs.huji.ac.il. This web server allows users to use PathScore annotations to explore the predictions of individual genes, gene sets, and functional annotations of specific genes. The availability is further described in Supplementary Note 2. Gene pair and gene set predictions are only available through the aforementioned web server. PathScore predictions can be obtained through a web server and attached as supplementary data3. The phylogenetic profile, models, and other raw data used to generate the analysis provided in this work can be found on Zenodo: https://doi.org/10.5281/zenodo.5111607. This article provides source data.

The code is available through the Github repository: https://github.com/dst1/MLPP. This includes training models and pipelines to produce predictions, as well as reproducing the example benchmarks in Figure 1A and Supplementary Figure 3.

Tabach, Y. etc. Use phylogenetic conservation and differentiation patterns to identify small RNA pathway genes. Nature 493, 694–698 (2013).

ADS CAS Article PubMed Google Scholar 

Tabach, Y. etc. Phylogenetic analysis discovers human disease loci and maps them to molecular pathways. Mole. system. biology. 9, 692 (2013).

CAS Article PubMed PubMed Central Google Scholar 

Sherill-Rofe, D. etc. Draw global and local co-evolution maps of 600 species to identify new homologous recombination repair genes. Genome research. 29, 439–448 (2019).

CAS Article PubMed PubMed Central Google Scholar 

Dey, G., Jaimovich, A., Collins, SR, Seki, A. & Meyer, T. Discover the principles of human gene function and modular organization through a phylogenetic analysis system. Cell Representative 10, 993–1006 (2015).

CAS Article PubMed PubMed Central Google Scholar 

Li, Y., Calvo, SE, Gutman, R., Liu, JS & Mootha, VK Expansion of biological pathways based on evolutionary reasoning. Cell 158, 213–225 (2014).

CAS Article PubMed PubMed Central Google Scholar 

Pellegrini, M., Marcotte, EM, Thompson, MJ, Eisenberg, D. & Yeates, TO Assign protein function through comparative genomic analysis: protein phylogeny profile. PNAS 96, 4285–4288 (1999).

ADS CAS Article PubMed PubMed Central Google Scholar 

Shin, J. & Lee, I. Co-inheritance analysis in the field of life has significantly improved network reasoning through phylogenetic analysis. PLoS ONE 10, e0139006 (2015).

Article CAS PubMed PubMed Central Google Scholar 

Date, SV & Marcotte, EM found uncharacterized cell systems through genome-wide analysis of functional connections. Nat. Biotechnology. 21, 1055–1062 (2003).

CAS Article PubMed Google Scholar 

Kensche, PR, van Noort, V., Dutilh, BE & Huynen, MA Practical and theoretical progress in predicting protein function through phylogenetic distribution. JR society. Interface 5, 151 LP–151170 (2008).

Tsaban, T. etc. CladeOScope: Perform functional interaction through the prism of clade co-evolution. NAR genome. Bioinformatics. 3. lqab024 (2021).

Avidor-Reiss, T. etc. Decoding cilia function: Define the special genes needed to distinguish the biogenesis of cilia. Cell 117, 527-539 (2004).

CAS Article PubMed Google Scholar 

Bowman, JM et al. Comprehensive genomics identified the MCU as an important part of the mitochondrial calcium uniporter. Nature 476, 341–345 (2011).

ADS CAS Article PubMed PubMed Central Google Scholar 

Škunca, N. and Dessimoz, C. Phylogenetic analysis: How much input data is sufficient? PLoS ONE 10, e0114701 (2015).

Article CAS PubMed PubMed Central Google Scholar 

Dey, G. & Meyer, T. Phylogenetic analysis for exploring the modular structure of the human genome. Cell system. 1, 106–115 (2015).

CAS Article PubMed PubMed Central Google Scholar 

Pandey, AK, Lu, L., Wang, X., Homayouni, R. & Williams, RW Functional mystery genes: a case study of brain ignorance. PLoS ONE 9, e88889 (2014).

ADS article CAS PubMed PubMed Central Google Scholar 

Stoeger, T., Gerlach, M., Morimoto, RI & Amaral, LAN A large-scale investigation of the reasons why potentially important genes are ignored. Public Science Library Biology. 16, e2006643 (2018).

Article CAS PubMed PubMed Central Google Scholar 

Haynes, WA, Tomczak, A. & Khatri, P. Gene annotation bias hinders biomedical research. science. Representative 8, 1362 (2018).

ADS article CAS PubMed PubMed Central Google Scholar 

Duek, P., Gateau, A., Bairoch, A. & Lane, L. Use neXtProt to explore the uncharacterized human proteome. J. Proteome research. acs.jproteome.8b00537 (2018) https://doi.org/10.1021/acs.jproteome.8b00537 (2018).

Li, Y., Ning, S., Calvo, SE, Mootha, VK & Liu, JS Bayesian hidden Markov tree model, used to cluster genes with a common evolutionary history. install. statistics. 46, 1721–1741 (2018).

Croft, D. et al. Reactome: A database of reactions, pathways, and biological processes. Nucleic acid research. 39, D691–D697 (2011).

CAS Article PubMed Google Scholar 

Elkan, C. & Noto, K. Only learn classifiers from positive data and unlabeled data. in the process. The 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining-KDD 08 213 (ACM Press, 2008). https://doi.org/10.1145/1401890.1401920.

Mordelet, F. & Vert, J.-P. Bagging SVM learned from positive and unlabeled examples. Pattern recognition. Wright. 37, 201–209 (2014).

Claesen, M., De Smet, F., Suykens, JAK & De Moor, B. A robust ensemble method that uses SVM basic model to learn from frontal and unlabeled data. Neural Computing 160, 73–84 (2015).

Yang, P. et al. AdaSampling is used for positive unlabeled and labeled noise learning and bioinformatics applications. IEEE Translational Cybernetics 1-12, https://doi.org/10.1109/TCYB.2018.2816984 (2018).

Franceschini, A. etc. SVD-phy: Improve the prediction of protein function association through singular value decomposition of phylogenetic spectrum. Bioinformatics 32, 1085–1087 (2016).

CAS Article PubMed Google Scholar 

Niu, Y., Liu, C., Moghimyfiroozabad, S., Yang, Y. & Alavian, KN PrePhyloPro: Whole proteome association prediction based on phylogenetic profile. PeerJ 5, e3712 (2017).

Article CAS PubMed PubMed Central Google Scholar 

Park, Y. & Marcotte, EM Paired input calculation predicts the defects in the evaluation plan. Nat. Method 9, 1134–1136 (2012).

CAS Article PubMed PubMed Central Google Scholar 

Chatr-aryamontri, A. etc. BioGRID interactive database: updated in 2017. Nucleic acid research. 45, D369–D379 (2017).

CAS Article PubMed Google Scholar 

Orchard, S. etc. MINtAct project-IntAct as a general management platform for 11 molecular interaction databases. Nucleic acid research. 42, D358-D363 (2014).

CAS Article PubMed Google Scholar 

Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic acid research. 28, 27-30 (2000).

CAS Article PubMed PubMed Central Google Scholar 

Giurgiu, M. et al. Kunlun: Comprehensive Resources of Mammal Protein Complex-2019. Nucleic acid research. 47, D559–D563 (2019).

CAS Article PubMed Google Scholar 

UniProt Alliance. UniProt: Global Protein Knowledge Center. Nucleic acid research. 47, D506–D515 (2019).

Spark, JL, etc. Human Exonuclease 5 is a new sliding exonuclease required for genome stability. J. Biology. Chemistry 287, 42773–42783 (2012).

CAS Article PubMed PubMed Central Google Scholar 

Wang, C. etc. C17orf53 was identified as a new gene involved in the repair of inter-chain crosslinks. DNA Repair 95, 102946 (2020).

CAS Article PubMed PubMed Central Google Scholar 

Potts, PR, Porteus, MH & Yu, H. The human SMC5/6 complex promotes sister chromatid homologous recombination by recruiting the SMC1/3 cohesin complex to double-strand breaks. EMBO J. 25, 3377-3388 (2006).

CAS Article PubMed PubMed Central Google Scholar 

Otero, G. etc. Elongator, a multi-subunit component of a novel RNA polymerase II holoenzyme for transcription extension. Mole. Cell 3, 109–118 (1999).

CAS Article PubMed Google Scholar 

Burrell, RA etc. Replication pressure links the structural and quantitative instability of cancer chromosomes. Nature 494, 492–496 (2013).

ADS CAS Article PubMed PubMed Central Google Scholar 

Yu, Y. etc. Protects proliferating cell nuclear antigen from degradation by forming a complex with MutT Homolog2. J. Biology. Chemistry 284, 19310–19320 (2009).

CAS Article PubMed PubMed Central Google Scholar 

Olivieri, M. etc. A genetic map of human response to DNA damage. Cell Cell 182, 481–496 (2020). e21.

CAS Article PubMed Google Scholar 

Hosono, K., Sasaki, T., Minoshima, S. & Shimizu, N. Identification and characterization of a new gene family YPEL in a broad-spectrum eukaryotic species. Gene 340, 31–43 (2004).

CAS Article PubMed Google Scholar 

Lundberg, SM & Lee, S.-I. Explain the unified method of model prediction. Advances in neural information processing systems. (eds. Guyon, I. et al.) 30, (Curran Associates, Inc., 2017).

Lundberg, SM etc. From local explanation to global understanding, interpretable tree artificial intelligence. Natural machine intelligence. 2, 56-67 (2020).

Como, California et al. The analysis of the microsporidian genome reveals the evolutionary strategy of obligate intracellular growth. Genome research. 22, 2478–2488 (2012).

CAS Article PubMed PubMed Central Google Scholar 

McInnes, L., Healy, J., Saul, N., and Grossberger, L. UMAP: Unified Manifold Approximation and Projection. Open source software magazine. 3, 861 (2018).

Corradi, N. Microsporidia: Eukaryotic intracellular parasites formed by gene loss and horizontal gene transfer. Anu. Pastor microbes. 69, 167–183 (2015).

CAS Article PubMed Google Scholar 

Zarowiecki, M. and Berriman, M. The worm genome has taught us about the evolution of parasites. Parasitology 142, S85–S97 (2015).

CAS Article PubMed Google Scholar 

Cai, IJ etc. The genomes of four tapeworm species reveal their adaptability to parasitism. Nature 496, 57–63 (2013).

CAS Article PubMed PubMed Central Google Scholar 

Coghlan, A. etc. Comparative genomics of major parasitic worms. Nat. Gene. 51, 163–174 (2019).

Dyková, I., Fiala, I., Lom, J. & Lukeš, J. Perkinsiella amoebae-like endosymbionts of Neoparamoeba spp., relatives of Ichthyobodo. EUR. J. Protistol. 39, 37-52 (2003).

Barker, D., Meade, A. and Pagel, M. The evolutionary constraint model can improve the prediction of gene-related gains and loss of functional connections. Bioinformatics 23, 14-20 (2007).

CAS Article PubMed Google Scholar 

Mitreva, M., Blaxter, ML, Bird, DM & McCarter, JP Nematode comparative genomics. Trending genes. 21, 573–581 (2005).

CAS Article PubMed Google Scholar 

Parkinson, J. et al. Transcriptomics analysis of Nematode phylum. Nat. Gene 36, 1259–1267 (2004).

O'Leary, NA etc. NCBI's reference sequence (RefSeq) database: current status, classification extensions, and function notes. Nucleic acid research. 44, D733–D745 (2016).

Article CAS PubMed Google Scholar 

Sadreyev, IR, Ji, F., Cohen, E., Ruvkun, G. & Tabach, Y. PhyloGene server for identifying and visualizing co-evolving proteins using standardized phylogenetic profiles. Nucleic acid research. 43, W154–W159 (2015).

CAS Article PubMed PubMed Central Google Scholar 

Altschul, SF, Gish, W., Miller, W., Myers, EW & Lipman, DJ Basic local alignment search tool. J. Moore. biology. 215, 403–410 (1990).

Camacho, C. et al. BLAST: Architecture and applications. BMC biological information. 10, 421 (2009).

Enault, F., Suhre, K., Abergel, C., Poirot, O. & Claverie, J.-M. Annotated bacterial genomes using an improved phylogenetic profile. Bioinformatics 19, i105–i107 (2003).

Huntley, RP, etc. GOA database: Gene ontology annotation update in 2015. Nucleic acid research. 43, D1057–D1063 (2015).

CAS Article PubMed Google Scholar 

Ashburner, M. etc. Gene ontology: a tool for unifying biology. Nat. Gene. 25, 25–29 (2000).

CAS Article PubMed PubMed Central Google Scholar 

Gene Ontology Alliance. Gene ontology resources: 20 years, still strong. Nucleic acid research. 47, D330–D338 (2019).

Ceramics, EG, etc. Pathway Commons, a network resource for biological pathway data. Nucleic acid research. 39, D685–D690 (2011).

CAS Article PubMed Google Scholar 

Pedregosa, F. etc. Scikit-learn: Machine learning in Python. J. Mach. Learn. Reservoir 12, 2825-2830 (2012).

Ke, G. et al. LightGBM: An efficient gradient boosting decision tree. Advances in neural information processing systems. (eds. Guyon, I. et al.) 30, (Curran Associates, Inc., 2017).

Wright, R. pu_learning, Github repository, https://github.com/roywright/pu_learning (2017).

Stupp, D. AdaSampling, Github repository, https://github.com/dst1/AdaSampling (2018).

Yang, P. AdaSampling, Github repository, https://github.com/PYangLab/AdaSampling (2018).

Pal, CS etc. The Encyclopedia of Life v2: Provide global access to knowledge of life on earth. BDJ 2, e1079 (2014).

Poelen, JH, Simons, JD, and Mungall, CJ Global Biological Interactions: An open infrastructure for sharing and analyzing species interaction data sets. Ecology. notify. 24, 148–159 (2014).

Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R. & Pfister, H. UpSet: visualization of intersection sets. IEEE translation can be seen. calculate. Graphics. 20, 1983–1992 (2014).

Article PubMed PubMed Central Google Scholar 

Nothman, J. UpSetPlot, Github repository, https://github.com/jnothman/UpSetPlot (2019).

Yu, G., Wang, L.-G., Han, Y. & He, Q.-Y. clusterProfiler: An R package for comparing biological themes between gene clusters. OMICS: J. Integr. biology. 16, 284–287 (2012).

The author thanks Professor Hanah Margalit of the Hebrew University of Jerusalem for his insightful comments on our methods and manuscripts. We thank Dr. Alexandre Orthwein of McGill University for helping to identify candidate DNA repair genes and verifying them in the literature.

Department of Developmental Biology and Cancer Research, Israel-Canada Institute of Medicine, The Hebrew University of Jerusalem, 9112001, Jerusalem, Israel

Doron Stup, Elad Sharon, Edith Bloch and Yuval Tabach

Department of Biomedical Informatics, Harvard University, Boston, Massachusetts, 02115, USA

Department of Statistics and Data Science, Hebrew University of Jerusalem, Jerusalem, 9190501, Israel

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

DS, OZ, and YT conceived and designed this study. DS, ES, and IB prepared the phylogenetic analysis matrix used in this study. DS developed the method and generated all the analyses described in the paper. DS, MZ, OZ, and YT wrote this paper. All authors discussed the results and reviewed the paper.

Correspondence with Or Zuk or Yuval Tabach.

The author declares no competing interests.

Peer review information Nature Communications thanks Anis Karimpour-Fard and other anonymous reviewers for their contributions to the peer review of this work. Peer review reports are available.

The publisher states that Springer Nature remains neutral on the jurisdiction claims of published maps and agency affiliates.

Open Access This article has been licensed under the Creative Commons Attribution 4.0 International License Agreement, which permits use, sharing, adaptation, distribution and reproduction in any media or format, as long as you appropriately indicate the original author and source, and provide a link to the Creative Commons license , And indicate whether any changes have been made. The images or other third-party materials in this article are included in the article’s Creative Commons license, unless otherwise stated in the material’s credit line. If the article’s Creative Commons license does not include the material, and your intended use is not permitted by laws and regulations or exceeds the permitted use, you need to obtain permission directly from the copyright owner. To view a copy of this license, please visit http://creativecommons.org/licenses/by/4.0/.

Stupp, D., Sharon, E., Bloch, I. etc. Machine learning based on co-evolution predicts the functional interactions between human genes. Nat Commun 12, 6454 (2021). https://doi.org/10.1038/s41467-021-26792-w

DOI: https://doi.org/10.1038/s41467-021-26792-w

Anyone you share the following link with can read this content:

Sorry, there is currently no shareable link in this article.

Provided by Springer Nature SharedIt content sharing program

By submitting a comment, you agree to abide by our terms and community guidelines. If you find content that is abusive or does not comply with our terms or guidelines, please mark it as inappropriate.

Nature Communications (Nat Commun) ISSN 2041-1723 (online)